{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classification\n",
    "\n",
    "Machine learning models rely on optimizing an objective function, by seeking\n",
    "its minimum or maximum. It is important to understand that this objective\n",
    "function is usually decoupled from the evaluation metric that we want to\n",
    "optimize in practice. The objective function serves as a proxy for the\n",
    "evaluation metric. Therefore, in the upcoming notebooks, we will present the\n",
    "different evaluation metrics used in machine learning.\n",
    "\n",
    "This notebook aims at giving an overview of the classification metrics that\n",
    "can be used to evaluate the predictive model generalization performance. We\n",
    "can recall that in a classification setting, the vector `target` is\n",
    "categorical rather than continuous.\n",
    "\n",
    "We will load the blood transfusion dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")\n",
    "data = blood_transfusion.drop(columns=\"Class\")\n",
    "target = blood_transfusion[\"Class\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by checking the classes present in the target vector `target`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "target.value_counts().plot.barh()\n",
    "plt.xlabel(\"Number of samples\")\n",
    "_ = plt.title(\"Number of samples per classes present\\n in the target\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the vector `target` contains two classes corresponding to\n",
    "whether a subject gave blood. We will use a logistic regression classifier to\n",
    "predict this outcome.\n",
    "\n",
    "To focus on the metrics presentation, we will only use a single split instead\n",
    "of cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, shuffle=True, random_state=0, test_size=0.5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use a logistic regression classifier as a base model. We will train\n",
    "the model on the train set, and later use the test set to compute the\n",
    "different classification metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = LogisticRegression()\n",
    "classifier.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifier predictions\n",
    "\n",
    "Before we go into details regarding the metrics, we will recall what type of\n",
    "predictions a classifier can provide.\n",
    "\n",
    "For this reason, we will create a synthetic sample for a new potential donor:\n",
    "they donated blood twice in the past (1000 cm\u00b3 each time). The last time was\n",
    "6 months ago, and the first time goes back to 20 months ago."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_donor = pd.DataFrame(\n",
    "    {\n",
    "        \"Recency\": [6],\n",
    "        \"Frequency\": [2],\n",
    "        \"Monetary\": [1000],\n",
    "        \"Time\": [20],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the class predicted by the classifier by calling the method\n",
    "`predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifier.predict(new_donor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this information, our classifier predicts that this synthetic subject is\n",
    "more likely to not donate blood again.\n",
    "\n",
    "However, we cannot check whether the prediction is correct (we do not know the\n",
    "true target value). That's the purpose of the testing set. First, we predict\n",
    "whether a subject will give blood with the help of the trained classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_predicted = classifier.predict(data_test)\n",
    "target_predicted[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accuracy as a baseline\n",
    "\n",
    "Now that we have these predictions, we can compare them with the true\n",
    "predictions (sometimes called ground-truth) which we did not use until now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_test == target_predicted"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the comparison above, a `True` value means that the value predicted by our\n",
    "classifier is identical to the real value, while a `False` means that our\n",
    "classifier made a mistake. One way of getting an overall rate representing the\n",
    "generalization performance of our classifier would be to compute how many\n",
    "times our classifier was right and divide it by the number of samples in our\n",
    "set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "np.mean(target_test == target_predicted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This measure is called the accuracy. Here, our classifier is 78% accurate at\n",
    "classifying if a subject will give blood. `scikit-learn` provides a function\n",
    "that computes this metric in the module `sklearn.metrics`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "accuracy = accuracy_score(target_test, target_predicted)\n",
    "print(f\"Accuracy: {accuracy:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`LogisticRegression` also has a method named `score` (part of the standard\n",
    "scikit-learn API), which computes the accuracy score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifier.score(data_test, target_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion matrix and derived metrics\n",
    "\n",
    "The comparison that we did above and the accuracy that we calculated did not\n",
    "take into account the type of error our classifier was making. Accuracy is an\n",
    "aggregate of the errors made by the classifier. We may be interested in finer\n",
    "granularity - to know independently what the error is for each of the two\n",
    "following cases:\n",
    "\n",
    "- we predicted that a person will give blood but they did not;\n",
    "- we predicted that a person will not give blood but they did."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import ConfusionMatrixDisplay\n",
    "\n",
    "_ = ConfusionMatrixDisplay.from_estimator(classifier, data_test, target_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The in-diagonal numbers are related to predictions that were correct while\n",
    "off-diagonal numbers are related to incorrect predictions\n",
    "(misclassifications). We now know the four types of correct and erroneous\n",
    "predictions:\n",
    "\n",
    "* the top left corner are true positives (TP) and corresponds to people who\n",
    "  gave blood and were predicted as such by the classifier;\n",
    "* the bottom right corner are true negatives (TN) and correspond to people who\n",
    "  did not give blood and were predicted as such by the classifier;\n",
    "* the top right corner are false negatives (FN) and correspond to people who\n",
    "  gave blood but were predicted to not have given blood;\n",
    "* the bottom left corner are false positives (FP) and correspond to people who\n",
    "  did not give blood but were predicted to have given blood.\n",
    "\n",
    "Once we have split this information, we can compute metrics to highlight the\n",
    "generalization performance of our classifier in a particular setting. For\n",
    "instance, we could be interested in the fraction of people who really gave\n",
    "blood when the classifier predicted so or the fraction of people predicted to\n",
    "have given blood out of the total population that actually did so.\n",
    "\n",
    "The former metric, known as the precision, is defined as `TP / (TP + FP)` and\n",
    "represents how likely the person actually gave blood when the classifier\n",
    "predicted that they did. The latter, known as the recall, defined as\n",
    "`TP / (TP + FN)` and assesses how well the classifier is able to correctly\n",
    "identify people who did give blood. We could, similarly to accuracy,\n",
    "manually compute these values, however scikit-learn provides functions to\n",
    "compute these statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import precision_score, recall_score\n",
    "\n",
    "precision = precision_score(target_test, target_predicted, pos_label=\"donated\")\n",
    "recall = recall_score(target_test, target_predicted, pos_label=\"donated\")\n",
    "\n",
    "print(f\"Precision score: {precision:.3f}\")\n",
    "print(f\"Recall score: {recall:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These results are in line with what was seen in the confusion matrix. Looking\n",
    "at the left column, more than half of the \"donated\" predictions were correct,\n",
    "leading to a precision above 0.5. However, our classifier mislabeled a lot of\n",
    "people who gave blood as \"not donated\", leading to a very low recall of around\n",
    "0.1.\n",
    "\n",
    "## The issue of class imbalance\n",
    "At this stage, we could ask ourself a reasonable question. While the accuracy\n",
    "did not look bad (i.e. 77%), the recall score is relatively low (i.e. 12%).\n",
    "\n",
    "As we mentioned, precision and recall only focuses on samples predicted to be\n",
    "positive, while accuracy takes both into account. In addition, we did not look\n",
    "at the ratio of classes (labels). We could check this ratio in the training\n",
    "set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_train.value_counts(normalize=True).plot.barh()\n",
    "plt.xlabel(\"Class frequency\")\n",
    "_ = plt.title(\"Class frequency in the training set\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the positive class, `'donated'`, comprises only 24% of the\n",
    "samples. The good accuracy of our classifier is then linked to its ability to\n",
    "correctly predict the negative class `'not donated'` which may or may not be\n",
    "relevant, depending on the application. We can illustrate the issue using a\n",
    "dummy classifier as a baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "dummy_classifier = DummyClassifier(strategy=\"most_frequent\")\n",
    "dummy_classifier.fit(data_train, target_train)\n",
    "print(\n",
    "    \"Accuracy of the dummy classifier: \"\n",
    "    f\"{dummy_classifier.score(data_test, target_test):.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the dummy classifier, which always predicts the negative class `'not\n",
    "donated'`, we obtain an accuracy score of 76%. Therefore, it means that this\n",
    "classifier, without learning anything from the data `data`, is capable of\n",
    "predicting as accurately as our logistic regression model.\n",
    "\n",
    "The problem illustrated above is also known as the class imbalance problem.\n",
    "When the classes are imbalanced, accuracy should not be used. In this case,\n",
    "one should either use the precision and recall as presented above or the\n",
    "balanced accuracy score instead of accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import balanced_accuracy_score\n",
    "\n",
    "balanced_accuracy = balanced_accuracy_score(target_test, target_predicted)\n",
    "print(f\"Balanced accuracy: {balanced_accuracy:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The balanced accuracy is equivalent to accuracy in the context of balanced\n",
    "classes. It is defined as the average recall obtained on each class.\n",
    "\n",
    "## Evaluation and different probability thresholds\n",
    "\n",
    "All statistics that we presented up to now rely on `classifier.predict` which\n",
    "outputs the most likely label. We haven't made use of the probability\n",
    "associated with this prediction, which gives the confidence of the classifier\n",
    "in this prediction. By default, the prediction of a classifier corresponds to\n",
    "a threshold of 0.5 probability in a binary classification problem. We can\n",
    "quickly check this relationship with the classifier that we trained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_proba_predicted = pd.DataFrame(\n",
    "    classifier.predict_proba(data_test), columns=classifier.classes_\n",
    ")\n",
    "target_proba_predicted[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_predicted = classifier.predict(data_test)\n",
    "target_predicted[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since probabilities sum to 1 we can get the class with the highest probability\n",
    "without using the threshold 0.5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "equivalence_pred_proba = (\n",
    "    target_proba_predicted.idxmax(axis=1).to_numpy() == target_predicted\n",
    ")\n",
    "np.all(equivalence_pred_proba)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default decision threshold (0.5) might not be the best threshold that\n",
    "leads to optimal generalization performance of our classifier. In this case,\n",
    "one can vary the decision threshold, and therefore the underlying prediction,\n",
    "and compute the same statistics presented earlier. Usually, the two metrics\n",
    "recall and precision are computed and plotted on a graph. Each metric plotted\n",
    "on a graph axis and each point on the graph corresponds to a specific decision\n",
    "threshold. Let's start by computing the precision-recall curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import PrecisionRecallDisplay\n",
    "\n",
    "disp = PrecisionRecallDisplay.from_estimator(\n",
    "    classifier, data_test, target_test, pos_label=\"donated\", marker=\"+\"\n",
    ")\n",
    "disp = PrecisionRecallDisplay.from_estimator(\n",
    "    dummy_classifier,\n",
    "    data_test,\n",
    "    target_test,\n",
    "    pos_label=\"donated\",\n",
    "    color=\"tab:orange\",\n",
    "    linestyle=\"--\",\n",
    "    ax=disp.ax_,\n",
    ")\n",
    "plt.xlabel(\"Recall (also known as TPR or sensitivity)\")\n",
    "plt.ylabel(\"Precision (also known as PPV)\")\n",
    "plt.xlim(0, 1)\n",
    "plt.ylim(0, 1)\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "_ = disp.ax_.set_title(\"Precision-recall curve\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\">Scikit-learn will return a display containing all plotting element. Notably,\n",
    "displays will expose a matplotlib axis, named <tt class=\"docutils literal\">ax_</tt>, that can be used to add\n",
    "new element on the axis.\n",
    "You can refer to the documentation to have more information regarding the\n",
    "<a class=\"reference external\" href=\"https://scikit-learn.org/stable/visualizations.html#visualizations\">visualizations in scikit-learn</a></p>\n",
    "</div>\n",
    "\n",
    "On this curve, each blue cross corresponds to a level of probability which we\n",
    "used as a decision threshold. We can see that, by varying this decision\n",
    "threshold, we get different precision vs. recall values.\n",
    "\n",
    "A perfect classifier would have a precision of 1 for all recall values. A\n",
    "metric characterizing the curve is linked to the area under the curve (AUC)\n",
    "and is named average precision (AP). With an ideal classifier, the average\n",
    "precision would be 1.\n",
    "\n",
    "Notice that the AP of a `DummyClassifier`, used as baseline to define the\n",
    "chance level, coincides with the number of samples in the positive class\n",
    "divided by the total number of samples (this number is called the prevalence\n",
    "of the positive class)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prevalence = (\n",
    "    target_test.value_counts()[\"donated\"] / target_test.value_counts().sum()\n",
    ")\n",
    "print(f\"Prevalence of the class 'donated': {prevalence:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The precision and recall metric focuses on the positive class, however, one\n",
    "might be interested in the compromise between accurately discriminating the\n",
    "positive class and accurately discriminating the negative classes. The\n",
    "statistics used for this are sensitivity and specificity. Sensitivity is just\n",
    "another name for recall. However, specificity measures the proportion of\n",
    "correctly classified samples in the negative class defined as: TN / (TN + FP).\n",
    "Similar to the precision-recall curve, sensitivity and specificity are\n",
    "generally plotted as a curve called the Receiver Operating Characteristic\n",
    "(ROC) curve. Below is such a curve:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import RocCurveDisplay\n",
    "\n",
    "disp = RocCurveDisplay.from_estimator(\n",
    "    classifier, data_test, target_test, pos_label=\"donated\", marker=\"+\"\n",
    ")\n",
    "disp = RocCurveDisplay.from_estimator(\n",
    "    dummy_classifier,\n",
    "    data_test,\n",
    "    target_test,\n",
    "    pos_label=\"donated\",\n",
    "    color=\"tab:orange\",\n",
    "    linestyle=\"--\",\n",
    "    ax=disp.ax_,\n",
    ")\n",
    "plt.xlabel(\"False positive rate\")\n",
    "plt.ylabel(\"True positive rate\\n(also known as sensitivity or recall)\")\n",
    "plt.xlim(0, 1)\n",
    "plt.ylim(0, 1)\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "_ = disp.ax_.set_title(\"Receiver Operating Characteristic curve\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This curve was built using the same principle as the precision-recall curve:\n",
    "we vary the probability threshold for determining \"hard\" prediction and\n",
    "compute the metrics. As with the precision-recall curve, we can compute the\n",
    "area under the ROC (ROC-AUC) to characterize the generalization performance of\n",
    "our classifier. However, it is important to observe that the lower bound of\n",
    "the ROC-AUC is 0.5. Indeed, we show the generalization performance of a dummy\n",
    "classifier (the orange dashed line) to show that even the worst generalization\n",
    "performance obtained will be above this line.\n",
    "\n",
    "Instead of using a dummy classifier, you can use the parameter `plot_chance_level`\n",
    "available in the ROC and PR displays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs = plt.subplots(ncols=2, nrows=1, figsize=(15, 7))\n",
    "\n",
    "PrecisionRecallDisplay.from_estimator(\n",
    "    classifier,\n",
    "    data_test,\n",
    "    target_test,\n",
    "    pos_label=\"donated\",\n",
    "    marker=\"+\",\n",
    "    plot_chance_level=True,\n",
    "    chance_level_kw={\"color\": \"tab:orange\", \"linestyle\": \"--\"},\n",
    "    ax=axs[0],\n",
    ")\n",
    "RocCurveDisplay.from_estimator(\n",
    "    classifier,\n",
    "    data_test,\n",
    "    target_test,\n",
    "    pos_label=\"donated\",\n",
    "    marker=\"+\",\n",
    "    plot_chance_level=True,\n",
    "    chance_level_kw={\"color\": \"tab:orange\", \"linestyle\": \"--\"},\n",
    "    ax=axs[1],\n",
    ")\n",
    "\n",
    "_ = fig.suptitle(\"PR and ROC curves\")"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}